News Archive

Protein Data Bank Archives its 100,000th Molecule Structure

International Research Archive Co-Hosted by SDSC Doubles in Size Since 2008

Published 05/14/2014

The number of structures available in the PDB per year, as of May 14, 2014. Highlighted examples include myoglobin (1; PDB ID 1mbn), the first structure solved by X-ray crystallography, and small enzymes (2; top: 4pti, bottom right: 2cha, bottom left: 3cpa). As technologies developed, the archive grew to host examples of tRNA (3; 6tna), viruses (4; 4rhv), antibodies (5; 1igt), protein-DNA complexes (6; top to bottom, 1j59, 1tro, 2bop, 1aoi), ribosomes (7; 1fjg, 1fka, 1ffk), and chaperones (8; 1aon). Image courtesy of wwPDB

[Click image to enlarge]
The number of structures available in the PDB per year, as of May 14, 2014. Highlighted examples include myoglobin (1; PDB ID 1mbn), the first structure solved by X-ray crystallography, and small enzymes (2; top: 4pti, bottom right: 2cha, bottom left: 3cpa). As technologies developed, the archive grew to host examples of tRNA (3; 6tna), viruses (4; 4rhv), antibodies (5; 1igt), protein-DNA complexes (6; top to bottom, 1j59, 1tro, 2bop, 1aoi), ribosomes (7; 1fjg, 1fka, 1ffk), and chaperones (8; 1aon). Image courtesy of wwPDB

As the single worldwide repository for the three-dimensional structures of large molecules and nucleic acids that are vital to pharmacology and bioinformatics research, the Protein Data Bank (PDB) recently archived its 100,000th molecule structure, doubling its size in just six years. 

Four data centers, including one co-located at Rutgers, The State University of New Jersey; and the San Diego Supercomputer Center (SDSC)/Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California, San Diego, support online access to the three-dimensional structures of biological macromolecules that help researchers understand many facets of biomedicine, agriculture, and ecology, from protein synthesis to health and disease to biological energy.

Established in 1971, this central, public archive of experimentally-determined protein and nucleic acid structures has reached this critical milestone thanks to the efforts of structural biologists throughout the world. 

“The PDB is a critical resource for the international community of working scientists which includes everyone from geneticists to pharmaceutical companies interested in drug targets,” said Nobel Laureate Venki Ramakrishnan, of the MRC Laboratory of Molecular Biology in Cambridge, England, in a wwwPDB release marking the milestone this week. 

“SDSC has provided safe haven for the PDB since it arrived at UC San Diego in the late 1990s, along with Phil Bourne,” said SDSC Director Michael Norman. “It was the project that initially got us involved in data science, and it remains an important element in our ‘Big Data’ strategy. I congratulate the PDB project for their success and achieving this significant milestone.” 

Bourne recently joined the National Institutes of Health as the Associate Director for Data Science. He formerly was Associate Vice Chancellor for Innovation and Industry Alliances, a Professor in the Department of Pharmacology and Skaggs School of Pharmacy and Pharmaceutical Sciences at UC San Diego, an SDSC Distinguished Scientist, as well as Associate Director of the RCSB (Research Collaboratory for Structural Bioinformatics) PDB. 

Function Follows Form
In the 1950s, scientists had their first direct look at the structures of proteins and DNA at the atomic level. Determination of these early three-dimensional structures by X-ray crystallography ushered in a new era in biology—one driven by the intimate link between form and biological function.   As the value of archiving and sharing these data was quickly recognized by the scientific community, the PDB was established as the first open access digital resource in all of biology by an international collaboration in 1971, with data centers located in the U.S. and the United Kingdom. 

Among the first structures deposited in the PDB were those of myoglobin and hemoglobin, two oxygen-binding molecules whose structures were elucidated by Chemistry Nobel Laureates John Kendrew and Max Perutz. With this week's regular update, the PDB welcomes 219 new structures into the archive. These structures join others vital to drug discovery, bioinformatics, and education, for a total of 100,147 entries. 

The PDB releases approximately 200 new structures to the scientific community every week. The resource is accessed hundreds of millions of times annually by researchers, students, and educators intent on exploring how different proteins are related to one another, to clarify fundamental biological mechanisms and discover new medicines. 

Future Challenges
As the scientific community eagerly awaits many more structures to be deposited in the PDB along with the invaluable knowledge these additions will bring, the increasing number, size, and complexity of that data constitute major challenges for the management of the archive. The wwPDB earlier this year launched a new Deposition and Annotation System designed to meet the evolving needs of the scientific community over the next decade. Since its initial launch, more than 750 X-ray crystallographic structures from 30 countries have been deposited using the new system. 

About the wwPDB
The wwPDB is the international partnership of four data centers that manage the PDB archive. Its mission is to maintain a single archive of macromolecular structural data that is freely and publicly available to the global community. It consists of the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) at Rutgers, The State University of New Jersey; the San Diego Supercomputer Center and Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California, San Diego; BioMagResBank (BMRB) at the University of Wisconsin; the Protein Data Bank in Europe (PDBe) at the EMBL European Bioinformatics Institute; and the Protein Data Bank Japan (PDBj) at Osaka University. The RCSB PDB receives funds from the National Science Foundation, National Institutes of Health, and the U.S. Department of Energy (DOE). 

About SDSC
As an Organized Research Unit of UC San Diego, SDSC is considered a leader in data-intensive computing and cyberinfrastructure, providing resources, services, and expertise to the national research community, including industry and academia. Cyberinfrastructure refers to an accessible, integrated network of computer-based resources and expertise, focused on accelerating scientific inquiry and discovery. SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from earth sciences and biology to astrophysics, bioinformatics, and health IT. With its two newest supercomputers, Trestles and Gordon, and a new system called Comet to be deployed in early 2015, SDSC is a partner in XSEDE (Extreme Science and Engineering Discovery Environment), the most advanced collection of integrated digital resources and services in the world.

Media Contacts:
Jan Zverina, SDSC Communications
858 534-5111 or jzverina@sdsc.edu

Warren R. Froelich, SDSC Communications
858 822-3622 or froelich@sdsc.edu

Christine Zardecki, RCSB Protein Data Bank, Rutgers,
The State University of New Jersey
848 445-0103 or info@rcsb.org

Related Links

Worldwide Protein Data Bank: http://wwpdb.org
San Diego Supercomputer Center:  http://sdsc.edu/
Rutgers, the State University of New Jersey: http://rutgers.edu/
National Science Foundation: http://nsf.gov/